Statistical Learning Lab 12 - Clustering

library(jsonlite)
library(ggplot2)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()  masks stats::filter()
✖ purrr::flatten() masks jsonlite::flatten()
✖ dplyr::lag()     masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dbscan)

Attaching package: 'dbscan'

The following object is masked from 'package:stats':

    as.dendrogram
library(mclust)
Package 'mclust' version 6.1.1
Type 'citation("mclust")' for citing this R package in publications.

Attaching package: 'mclust'

The following object is masked from 'package:purrr':

    map
library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'

The following object is masked from 'package:dplyr':

    combine

The following object is masked from 'package:ggplot2':

    margin
df_snakes = fromJSON('Snake_educlust.json')
df_snake = df_snakes$data
df_snake$xpo = as.numeric(df_snake$xpo)
df_snake  = df_snake %>% select(xpo, ypo)

df_hier = fromJSON('Hierarchical_educlust.json')
df_hier = df_hier$data
df_hier$xpo = as.numeric(df_hier$xpo)
df_hier  = df_hier %>% select(xpo, ypo)

Clustering

\(K\)-Means Clustering

The function kmeans() performs \(K\)-means clustering in R. We begin with a simple simulated example in which there truly are two clusters in the data: the first 25 observations have a mean shift relative to the next 25 observations.

set.seed(2)
x <- matrix(rnorm(50 * 2), ncol = 2)
x[1:25, 1] <- x[1:25, 1] + 3
x[1:25, 2] <- x[1:25, 2] - 4

We now perform \(K\)-means clustering with \(K=2\).

km.out <- kmeans(x, 2, nstart = 20)

The cluster assignments of the 50 observations are contained in km.out$cluster.

km.out$cluster
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
[39] 2 2 2 2 2 2 2 2 2 2 2 2

The \(K\)-means clustering perfectly separated the observations into two clusters even though we did not supply any group information to kmeans(). We can plot the data, with each observation colored according to its cluster assignment.

#par(mfrow = c(1, 2))
plot(x, col = (km.out$cluster + 1),
    main = "K-Means Clustering Results with K = 2",
    xlab = "", ylab = "", pch = 20, cex = 2)

Here the observations can be easily plotted because they are two-dimensional. If there were more than two variables then we could instead perform PCA and plot the first two principal components score vectors.

In this example, we knew that there really were two clusters because we generated the data. However, for real data, in general we do not know the true number of clusters. We could instead have performed \(K\)-means clustering on this example with \(K=3\).

set.seed(4)
km.out <- kmeans(x, 3, nstart = 20)
km.out
K-means clustering with 3 clusters of sizes 10, 23, 17

Cluster means:
        [,1]        [,2]
1  2.3001545 -2.69622023
2 -0.3820397 -0.08740753
3  3.7789567 -4.56200798

Clustering vector:
 [1] 3 1 3 1 3 3 3 1 3 1 3 1 3 1 3 1 3 3 3 3 3 1 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2
[39] 2 2 2 2 2 1 2 1 2 2 2 2

Within cluster sum of squares by cluster:
[1] 19.56137 52.67700 25.74089
 (between_SS / total_SS =  79.3 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
plot(x, col = (km.out$cluster + 1),
    main = "K-Means Clustering Results with K = 3",
    xlab = "", ylab = "", pch = 20, cex = 2)

When \(K=3\), \(K\)-means clustering splits up the two clusters.

To run the kmeans() function in R with multiple initial cluster assignments, we use the nstart argument. If a value of nstart greater than one is used, then \(K\)-means clustering will be performed using multiple random assignments in Step~1 of Algorithm 12.2, and the kmeans() function will report only the best results. Here we compare using nstart = 1 to nstart = 20.

set.seed(4)
km.out <- kmeans(x, 3, nstart = 1)
km.out$tot.withinss
[1] 104.3319
km.out <- kmeans(x, 3, nstart = 20)
km.out$tot.withinss
[1] 97.97927

Note that km.out$tot.withinss is the total within-cluster sum of squares, which we seek to minimize by performing \(K\)-means clustering (Equation 12.17). The individual within-cluster sum-of-squares are contained in the vector km.out$withinss.

We strongly recommend always running \(K\)-means clustering with a large value of nstart, such as 20 or 50, since otherwise an undesirable local optimum may be obtained.

When performing \(K\)-means clustering, in addition to using multiple initial cluster assignments, it is also important to set a random seed using the set.seed() function. This way, the initial cluster assignments in Step~1 can be replicated, and the \(K\)-means output will be fully reproducible.

Let’s try k-means clustering on a more complicated dataset. Open questions:

  • How many clusters are in this dataset? (Hyperparameter choice)

  • How many can we expect k-means to detect?

df_hier %>% ggplot(aes(x = xpo, y = ypo)) + geom_point()

We can attempt to tune our clustering algorithm by observing the total within-cluster sum of squares over each level of the hyperparameter k. Keep in mind this is by no means a “correct” partition of the dataspace but rather a tool to aid in decisionmaking.

tune_kmeans = function(data, kmax) {
  result = c()
  for (i in c(1:kmax)) {
    km.out <- kmeans(data, i, nstart = 20)
    result[i] = km.out$tot.withinss
  }
  df_viz = data.frame(k = c(1:kmax),
                      within_ss = result)
  
}


df_viz = tune_kmeans(df_hier, 10)

df_viz %>%
  ggplot(aes(x = k, y = within_ss)) +
  geom_line() +  scale_x_continuous(breaks=seq(1,10,1))

set.seed(4)
km.out <- kmeans(df_hier, 6, nstart = 50)
plot(df_hier, col = (km.out$cluster + 1),
    main = "K-Means Clustering Results",
    xlab = "", ylab = "", pch = 20, cex = 2)

Hierarchical Clustering

The hclust() function implements hierarchical clustering in R. In the following example we use the data from the previous lab to plot the hierarchical clustering dendrogram using complete, single, and average linkage clustering, with Euclidean distance as the dissimilarity measure. We begin by clustering observations using complete linkage. The dist() function is used to compute the \(50 \times 50\) inter-observation Euclidean distance matrix.

hc.complete <- hclust(dist(x), method = "complete")
hc.single <- hclust(dist(x), method = "single")

We can now plot the dendrograms obtained using the usual plot() function. The numbers at the bottom of the plot identify each observation.

par(mfrow = c(1, 2))
plot(hc.complete, main = "Complete Linkage",
    xlab = "", sub = "", cex = .9)
plot(hc.single, main = "Single Linkage",
    xlab = "", sub = "", cex = .9)

To determine the cluster labels for each observation associated with a given cut of the dendrogram, we can use the cutree() function:

cutree(hc.complete, 2)
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
[39] 2 2 2 2 2 2 2 2 2 2 2 2
cutree(hc.single, 2)
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[39] 1 1 1 1 1 1 1 1 1 1 1 1

The second argument to cutree() is the number of clusters we wish to obtain. For this data, complete and average linkage generally separate the observations into their correct groups. However, single linkage identifies one point as belonging to its own cluster. A more sensible answer is obtained when four clusters are selected, although there are still two singletons.

cutree(hc.single, 4)
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3
[39] 3 3 3 4 3 3 3 3 3 3 3 3

Lets try our hierarchical dataset again:

plot(hclust(dist(df_hier), method = "complete"),
     labels = FALSE, hang = -1, ann = FALSE, xlim = c(0,10))

set.seed(4)
hc.complete <- hclust(dist(df_hier), method = "complete")

df_viz = df_hier
df_viz$cluster = factor(cutree(hc.complete, 9))

df_viz %>% ggplot(aes( x = xpo, y = ypo, color = cluster)) + geom_point()

Using hierarchical clustering we have a nice understanding of the hierarchical nature of our data. The dendrogram shows clearly the order in which the communities are constructed and can be used to detect nested communities. We can observe that the model is quite susceptible to the random noise between the “expected” clusters.

DBSCAN

If we use a density based approach the result changes slightly: The model fails in helping us understand the hierarchical data structure on the right. Keep in mind that the algorithm directly uses the density of observations in the dataspace. A mix of high and low clustering makes a proper choice for epsilon difficult.

It is however much less susceptible to noise as the minPts parameter does not allow the model fit arbitrarily small clusters, instead all the noise is grouped into one additional cluster.

eps <- 5  
minPts <- 3

db <- dbscan(df_hier, eps = eps, minPts = minPts)

df_viz = df_hier
df_viz$col = factor(db$cluster)
df_viz %>% ggplot(aes( x = xpo, y = ypo, color = col)) + geom_point()

Model Based Clustering - Gaussian Mixture Model

Gaussian-mixture model is as the name implies a model-based approach, meaning we can get uncertainty estimations for each data point and each cluster!

The “classification” vector stores the highest probability class of each observation. The “uncertainty” vector stores the related uncertainty of that classification.

gmm = Mclust(df_hier)
df_viz = df_hier
df_viz$col = factor(gmm$classification)
df_viz %>% ggplot(aes( x = xpo, y = ypo, color = col)) + geom_point()

gmm_uncertainty = data.frame(gmm$uncertainty)
gmm_uncertainty %>%
  ggplot(aes(x = gmm.uncertainty)) + geom_density()

Z is a G x n matrix that stores each individual inclusion probability between observation-cluster pairs.

uncertainty_matrix = data.frame(gmm[["z"]])
head(uncertainty_matrix, n = 5)
            X1           X2           X3           X4           X5
1 9.999863e-01 1.025362e-56 3.861321e-21 1.371037e-05 1.370416e-49
2 9.999901e-01 2.107018e-60 5.143002e-24 9.894991e-06 3.416674e-51
3 9.999928e-01 4.353228e-66 1.303422e-28 7.183974e-06 9.689301e-54
4 3.425475e-27 9.999989e-01 9.836350e-29 1.090513e-06 1.875254e-60
5 1.308746e-26 9.999988e-01 1.575384e-27 1.182247e-06 1.262486e-59
             X6            X7
1  1.209802e-92 1.639829e-130
2  3.117873e-93 5.294943e-133
3  2.487486e-94 5.873509e-137
4 1.603307e-157 1.367950e-131
5 8.962820e-156 1.208672e-130

Unfortunately we loose some flexibility with a model based approach: The gaussian assumption does not hold for many real world datasets and we cannot really detect arbitrarily shaped communities as the model expects convex / ellipsoid distributions.

gmm = Mclust(df_snake)
df_viz = df_snake
df_viz$col = factor(gmm$classification)
df_viz %>% ggplot(aes( x = xpo, y = ypo, color = col)) + geom_point()

Compare this to the previous models: Hierarchical clustering identifies our expected communities when using single linkage and DBSCAN can be tuned to do so as well.

set.seed(4)
hc.complete <- hclust(dist(df_snake), method = "single")

df_viz = df_snake
df_viz$cluster = factor(cutree(hc.complete, 4))

df_viz %>% ggplot(aes( x = xpo, y = ypo, color = cluster)) + geom_point()

eps <- 5  
minPts <- 3

db <- dbscan(df_snake, eps = eps, minPts = minPts)

df_viz = df_snake
df_viz$col = factor(db$cluster)
df_viz %>% ggplot(aes( x = xpo, y = ypo, color = col)) + geom_point()

Application - Young People Survey:

Let’s use our newfound skills to classify some people. Below we find survey data of 1000 people in Slovakia aged 15 to 30. We will try to detect distinct communities using clustering.

First we load the dataset and keep only numeric data (we could also transform the categorical)

responses = read.csv('responses.csv')
df_youth = select_if(responses, is.numeric)
df_youth = df_youth[complete.cases(df_youth), ]
set.seed(4)
hc.complete <- hclust(dist(df_youth), method = "complete")

plot(hclust(dist(df_youth), method = "complete"),
     labels = FALSE, hang = -1, ann = FALSE)

df_viz = df_youth
df_viz$cluster = factor(cutree(hc.complete, 4))

df_viz %>% ggplot(aes( x = Age,
                       y = Number.of.friends, color = cluster)) + geom_jitter()

This tells us very little as we have no idea which variables to visualize! There appear to be at least two interesting communities here but to learn more we need to go back to supervised learning.

df_viz$Cluster2 = ifelse(df_viz$cluster == 2, 1, 0)
df_viz = df_viz %>% select(-cluster)


rf.fit = randomForest(Cluster2 ~ ., data=df_viz, mtry = 20, importance = TRUE)
Warning in randomForest.default(m, y, ...): The response has five or fewer
unique values.  Are you sure you want to do regression?

Interestingly enough we see factors stereotypically associated with “outsiders” as most relevant for belonging in the second cluster. We will continue here next week with PCA.

varImpPlot(rf.fit, n.var = 10)